annotated data
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- (2 more...)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- (2 more...)
Refining Czech GEC: Insights from a Multi-Experiment Approach
Pechman, Petr, Straka, Milan, Straková, Jana, Náplava, Jakub
We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synthetic error introduction, several error generation strategies, domain balancing, tokenization granularity, model size, and data scaling during fine-tuning. Additionally, we evaluate the performance of large language models (LLMs) on Czech GEC in both end-user and expert fine-tuning scenarios. Our best-performing model is superior both in performance and computational efficiency. The source code and the trained model links are available on https://github.com/ufal/tsd2025-gec.
- Europe > Italy > Tuscany > Florence (0.04)
- Europe > Czechia > Prague (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (5 more...)
Self-Improving Model Steering
Zhu, Rongyi, Wang, Yuhui, Jiang, Tanqiu, Liang, Jiacheng, Wang, Ting
Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods in steering effectiveness and adaptability, highlighting self-improving model steering as a promising direction for future research on inference-time LLM alignment.
- North America > United States > Oklahoma (0.04)
- Oceania > Australia (0.04)
- North America > United States > Virginia (0.04)
- (8 more...)
- Media (0.46)
- Leisure & Entertainment (0.46)
Have LLMs Made Active Learning Obsolete? Surveying the NLP Community
Romberg, Julia, Schröder, Christopher, Gonsior, Julius, Tomanek, Katrin, Olsson, Fredrik
Supervised learning relies on annotated data, which is expensive to obtain. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Large language models (LLMs) have pushed the effectiveness of active learning, but have also improved methods such as few- or zero-shot learning, and text synthesis - thereby introducing potential alternatives. This raises the question: has active learning become obsolete? To answer this fully, we must look beyond literature to practical experiences. We conduct an online survey in the NLP community to collect previously intangible insights on the perceived relevance of data annotation, particularly focusing on active learning, including best practices, obstacles and expected future developments. Our findings show that annotated data remains a key factor, and active learning continues to be relevant. While the majority of active learning users find it effective, a comparison with a community survey from over a decade ago reveals persistent challenges: setup complexity, estimation of cost reduction, and tooling. We publish an anonymized version of the collected dataset
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (21 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
- Education (0.93)
- Information Technology > Security & Privacy (0.68)
Automated Annotation of Evolving Corpora for Augmenting Longitudinal Network Data: A Framework Integrating Large Language Models and Expert Knowledge
Liu, Xiao, Wu, Zirui, Li, Jiayi, Shao, Zhicheng, Pang, Xun, Feng, Yansong
Longitudinal network data are essential for analyzing political, economic, and social systems and processes. In political science, these datasets are often generated through human annotation or supervised machine learning applied to evolving corpora. However, as semantic contexts shift over time, inferring dynamic interaction types on emerging issues among a diverse set of entities poses significant challenges, particularly in maintaining timely and consistent annotations. This paper presents the Expert-Augmented LLM Annotation (EALA) approach, which leverages Large Language Models (LLMs) in combination with historically annotated data and expert-constructed codebooks to extrapolate and extend datasets into future periods. We evaluate the performance and reliability of EALA using a dataset of climate negotiations. Our findings demonstrate that EALA effectively predicts nuanced interactions between negotiation parties and captures the evolution of topics over time. At the same time, we identify several limitations inherent to LLM-based annotation, highlighting areas for further improvement. Given the wide availability of codebooks and annotated datasets, EALA holds substantial promise for advancing research in political science and beyond.
- Oceania > New Zealand (0.05)
- Oceania > Australia (0.05)
- Europe > Iceland (0.05)
- (17 more...)
Iterative Auto-Annotation for Scientific Named Entity Recognition Using BERT-Based Models
This paper presents an iterative approach to performing Scientific Named Entity Recognition (SciNER) using BERT - based models. We leverage transfer learning to fine - tune pre - trained models with a small but high - quality set of manually annotated data. The process is iteratively refined by using the fine - tuned model to auto - annotate a larger dataset, followed by additional rounds of fine - tuning. We evaluated two models, dslim/bert - large - NER and bert - large - cased, and found that bert - large - cased consistently outperformed the former. Our approach demonstrated significant improvements in prediction accuracy and F1 sco res, especially for less common entity classes. Future work could include pre - training with unlabeled data, exploring more powerful encoders like RoBERTa, and expanding the scope of manual annotations. This methodology has broader applications in NLP tasks where access to labeled data is limited.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > District of Columbia > Washington (0.04)
Beyond English: Evaluating Automated Measurement of Moral Foundations in Non-English Discourse with a Chinese Case Study
Cheng, Calvin Yixiang, Hale, Scott A
This study explores computational approaches for measuring moral foundations (MFs) in non-English corpora. Since most resources are developed primarily for English, cross-linguistic applications of moral foundation theory remain limited. Using Chinese as a case study, this paper evaluates the effectiveness of applying English resources to machine translated text, local language lexicons, multilingual language models, and large language models (LLMs) in measuring MFs in non-English texts. The results indicate that machine translation and local lexicon approaches are insufficient for complex moral assessments, frequently resulting in a substantial loss of cultural information. In contrast, multilingual models and LLMs demonstrate reliable cross-language performance with transfer learning, with LLMs excelling in terms of data efficiency. Importantly, this study also underscores the need for human-in-the-loop validation of automated MF assessment, as the most advanced models may overlook cultural nuances in cross-language measurements. The findings highlight the potential of LLMs for cross-language MF measurements and other complex multilingual deductive coding tasks.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Republic of Türkiye (0.04)
- North America > Mexico (0.04)
- (8 more...)
Self-Calibrated Dual Contrasting for Annotation-Efficient Bacteria Raman Spectroscopy Clustering and Classification
Yao, Haiming, Luo, Wei, Zhou, Tao, Gao, Ang, Wang, Xue
Raman scattering is based on molecular vibration spectroscopy and provides a powerful technology for pathogenic bacteria diagnosis using the unique molecular fingerprint information of a substance. The integration of deep learning technology has significantly improved the efficiency and accuracy of intelligent Raman spectroscopy (RS) recognition. However, the current RS recognition methods based on deep neural networks still require the annotation of a large amount of spectral data, which is labor-intensive. This paper presents a novel annotation-efficient Self-Calibrated Dual Contrasting (SCDC) method for RS recognition that operates effectively with few or no annotation. Our core motivation is to represent the spectrum from two different perspectives in two distinct subspaces: embedding and category. The embedding perspective captures instance-level information, while the category perspective reflects category-level information. Accordingly, we have implemented a dual contrastive learning approach from two perspectives to obtain discriminative representations, which are applicable for Raman spectroscopy recognition under both unsupervised and semi-supervised learning conditions. Furthermore, a self-calibration mechanism is proposed to enhance robustness. Validation of the identification task on three large-scale bacterial Raman spectroscopy datasets demonstrates that our SCDC method achieves robust recognition performance with very few (5$\%$ or 10$\%$) or no annotations, highlighting the potential of the proposed method for biospectral identification in annotation-efficient clinical scenarios.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.88)
Enhanced MRI Representation via Cross-series Masking
Wang, Churan, Gao, Fei, Yan, Lijun, Wang, Siwen, Yu, Yizhou, Wang, Yizhou
Magnetic resonance imaging (MRI) is indispensable for diagnosing and planning treatment in various medical conditions due to its ability to produce multi-series images that reveal different tissue characteristics. However, integrating these diverse series to form a coherent analysis presents significant challenges, such as differing spatial resolutions and contrast patterns meanwhile requiring extensive annotated data, which is scarce in clinical practice. Due to these issues, we introduce a novel Cross-Series Masking (CSM) Strategy for effectively learning MRI representation in a self-supervised manner. Specifically, CSM commences by randomly sampling a subset of regions and series, which are then strategically masked. In the training process, the cross-series representation is learned by utilizing the unmasked data to reconstruct the masked portions. This process not only integrates information across different series but also facilitates the ability to model both intra-series and inter-series correlations and complementarities. With the learned representation, the downstream tasks like segmentation and classification are also enhanced. Taking brain tissue segmentation, breast tumor benign/malignant classification, and prostate cancer diagnosis as examples, our method achieves state-of-the-art performance on both public and in-house datasets.
- Europe > Switzerland (0.04)
- South America > Colombia > Bolivar Department > Cartagena (0.04)
- North America > United States > Virginia > Arlington County > Arlington (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)